We were given the data set of The National Health and Nutrition Examination Survey (NHANES). The survey program has been conducted as a series of surveys designed to assess the health and nutritional status of adults and children in the United States since the 1960s, according to CDC (2023). It combines in-person face-to-face interviews and physical examinations of participants for data collection.
The survey data wasn’t a simple random sample, however. According to CDC’s National Health and Nutrition Examination Survey: Plan and Operations, 1999–2010 (G et al. 2013), the sampling strategy consists of several stages: 1. Selection of counties as primary sampling units (PSU). 2. selection of segments within PSUs that constitute blocks of households. 3. Selection of specific households within segments. 4. Selection of individuals within a household.
We aim to study the relationship between the weight variable and the other health related variables of the data.
We began our study by doing an exploratory analysis among the variables through various tables and charts. We then performed several hypothesis tests on some of the variables. Lastly we did a linear regression model fit to the response variable “weight” with other variables and confounders.
We began our analysis by giving a data dictionary of the data shown in Table 1 below. As one can see that some variables have a high percentage of missing values. In Part 2 we made hypothesis tests to decide if some of these variables could be excluded from the regression analysis in Part 3.
The weight variable was a continuous random variable in our data. A simple way of categorizing it was to consider the BMI indicator. As one could see there was an obese variable in the data. The weight variable was categorized by giving a threshold of 35 to the BMI value. A person is considered healthy if the BMI is below 35, and obese otherwise. Therefore, we used the obese variable as the categorical random variable in our project.
From the table 1, we can realized that there are 6445 observations with 21 variables in our data set and 8 variables can be considered as categorical variables. But in original data set, it exists 6482 observations and 37 of them are missing information for variable bmi. We deleted these missing data and use BMI level to stratified observations, since the missing data less than 0.6% in total.
##
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
##
## units, units<-
| Healthy weight | Obesity | Overweight | Underweight | Overall | |
|---|---|---|---|---|---|
| (N=1883) | (N=2311) | (N=2127) | (N=124) | (N=6445) | |
| Gender | |||||
| Male | 897 (47.6%) | 1036 (44.8%) | 1171 (55.1%) | 40 (32.3%) | 3144 (48.8%) |
| Female | 986 (52.4%) | 1275 (55.2%) | 956 (44.9%) | 84 (67.7%) | 3301 (51.2%) |
| Age (years) | |||||
| Mean (SD) | 41.2 (20.6) | 48.7 (17.7) | 48.9 (19.0) | 37.9 (21.0) | 46.4 (19.4) |
| Median [Min, Max] | 37.0 [16.0, 80.0] | 49.0 [16.0, 80.0] | 49.0 [16.0, 80.0] | 30.0 [16.0, 80.0] | 46.0 [16.0, 80.0] |
| Marital Status | |||||
| Married | 741 (39.4%) | 1158 (50.1%) | 1074 (50.5%) | 31 (25.0%) | 3004 (46.6%) |
| Widowed | 121 (6.4%) | 185 (8.0%) | 190 (8.9%) | 8 (6.5%) | 504 (7.8%) |
| Divorced | 154 (8.2%) | 262 (11.3%) | 210 (9.9%) | 14 (11.3%) | 640 (9.9%) |
| Separated | 47 (2.5%) | 82 (3.5%) | 63 (3.0%) | 1 (0.8%) | 193 (3.0%) |
| Never Married | 351 (18.6%) | 353 (15.3%) | 289 (13.6%) | 30 (24.2%) | 1023 (15.9%) |
| Living Together | 141 (7.5%) | 148 (6.4%) | 157 (7.4%) | 8 (6.5%) | 454 (7.0%) |
| Missing | 328 (17.4%) | 123 (5.3%) | 144 (6.8%) | 32 (25.8%) | 627 (9.7%) |
| Statistical Weight | |||||
| Mean (SD) | 36700 (26000) | 33000 (25100) | 34200 (26300) | 37400 (27800) | 34600 (25800) |
| Median [Min, Max] | 26100 [5050, 154000] | 23200 [4080, 124000] | 23600 [4450, 141000] | 26500 [6840, 113000] | 24200 [4080, 154000] |
| Pseudo-PSU | |||||
| Mean (SD) | 1.51 (0.500) | 1.50 (0.500) | 1.51 (0.500) | 1.50 (0.502) | 1.51 (0.500) |
| Median [Min, Max] | 2.00 [1.00, 2.00] | 2.00 [1.00, 2.00] | 2.00 [1.00, 2.00] | 1.50 [1.00, 2.00] | 2.00 [1.00, 2.00] |
| Pseudo-stratum | |||||
| Mean (SD) | 7.11 (4.09) | 7.36 (4.13) | 7.15 (4.16) | 7.80 (4.14) | 7.22 (4.13) |
| Median [Min, Max] | 7.00 [1.00, 15.0] | 7.00 [1.00, 15.0] | 7.00 [1.00, 15.0] | 8.00 [1.00, 15.0] | 7.00 [1.00, 15.0] |
| Total Cholesterol (mg/dL) | |||||
| Mean (SD) | 185 (39.9) | 194 (40.5) | 198 (42.8) | 172 (33.4) | 192 (41.4) |
| Median [Min, Max] | 180 [92.0, 383] | 191 [92.0, 357] | 194 [90.0, 380] | 166 [108, 289] | 189 [90.0, 383] |
| Missing | 123 (6.5%) | 142 (6.1%) | 121 (5.7%) | 6 (4.8%) | 392 (6.1%) |
| HDL-Cholesterol (mg/dL) | |||||
| Mean (SD) | 58.4 (17.1) | 47.6 (13.7) | 51.8 (15.5) | 63.3 (17.1) | 52.5 (16.0) |
| Median [Min, Max] | 56.0 [11.0, 144] | 46.0 [15.0, 115] | 50.0 [16.0, 119] | 63.0 [26.0, 114] | 50.0 [11.0, 144] |
| Missing | 124 (6.6%) | 142 (6.1%) | 120 (5.6%) | 6 (4.8%) | 392 (6.1%) |
| Systolic Blood Pressure (mm Hg) | |||||
| Mean (SD) | 119 (18.5) | 125 (17.3) | 125 (18.5) | 111 (18.5) | 123 (18.3) |
| Median [Min, Max] | 116 [90.0, 220] | 124 [90.0, 200] | 122 [90.0, 208] | 106 [90.0, 220] | 120 [90.0, 220] |
| Missing | 164 (8.7%) | 206 (8.9%) | 154 (7.2%) | 20 (16.1%) | 544 (8.4%) |
| Diastolic Blood Pressure (mm Hg) | |||||
| Mean (SD) | 67.4 (11.2) | 71.3 (12.4) | 69.8 (11.8) | 65.7 (11.3) | 69.6 (11.9) |
| Median [Min, Max] | 68.0 [40.0, 118] | 72.0 [40.0, 134] | 70.0 [40.0, 118] | 66.0 [44.0, 110] | 70.0 [40.0, 134] |
| Missing | 167 (8.9%) | 230 (10.0%) | 170 (8.0%) | 18 (14.5%) | 585 (9.1%) |
| Weight (Kg) | |||||
| Mean (SD) | 63.1 (9.13) | 99.0 (17.7) | 77.3 (10.3) | 47.9 (5.56) | 80.4 (20.2) |
| Median [Min, Max] | 62.8 [38.5, 95.5] | 96.9 [57.8, 159] | 76.8 [45.5, 117] | 47.7 [33.2, 63.0] | 77.6 [33.2, 159] |
| Standing Height (cm) | |||||
| Mean (SD) | 168 (10.0) | 167 (10.4) | 168 (10.4) | 166 (7.59) | 167 (10.2) |
| Median [Min, Max] | 167 [140, 203] | 166 [135, 196] | 168 [123, 202] | 165 [147, 186] | 167 [123, 203] |
| Vigorous Work Activity | |||||
| Yes | 324 (17.2%) | 418 (18.1%) | 371 (17.4%) | 16 (12.9%) | 1129 (17.5%) |
| No | 1558 (82.7%) | 1893 (81.9%) | 1756 (82.6%) | 108 (87.1%) | 5315 (82.5%) |
| Missing | 1 (0.1%) | 0 (0%) | 0 (0%) | 0 (0%) | 1 (0.0%) |
| Moderate Work Activity | |||||
| Yes | 651 (34.6%) | 796 (34.4%) | 701 (33.0%) | 32 (25.8%) | 2180 (33.8%) |
| No | 1231 (65.4%) | 1515 (65.6%) | 1426 (67.0%) | 92 (74.2%) | 4264 (66.2%) |
| Missing | 1 (0.1%) | 0 (0%) | 0 (0%) | 0 (0%) | 1 (0.0%) |
| Walk or Bicycle | |||||
| Yes | 630 (33.5%) | 549 (23.8%) | 573 (26.9%) | 48 (38.7%) | 1800 (27.9%) |
| No | 1252 (66.5%) | 1762 (76.2%) | 1554 (73.1%) | 76 (61.3%) | 4644 (72.1%) |
| Missing | 1 (0.1%) | 0 (0%) | 0 (0%) | 0 (0%) | 1 (0.0%) |
| Vigorous Recreational Activities | |||||
| Yes | 579 (30.7%) | 344 (14.9%) | 449 (21.1%) | 27 (21.8%) | 1399 (21.7%) |
| No | 1303 (69.2%) | 1967 (85.1%) | 1678 (78.9%) | 97 (78.2%) | 5045 (78.3%) |
| Missing | 1 (0.1%) | 0 (0%) | 0 (0%) | 0 (0%) | 1 (0.0%) |
| Moderate Recreational Activities | |||||
| Yes | 834 (44.3%) | 791 (34.2%) | 823 (38.7%) | 37 (29.8%) | 2485 (38.6%) |
| No | 1048 (55.7%) | 1520 (65.8%) | 1303 (61.3%) | 87 (70.2%) | 3958 (61.4%) |
| Missing | 1 (0.1%) | 0 (0%) | 1 (0.0%) | 0 (0%) | 2 (0.0%) |
| Minutes of Sedentary Activity per Week (mins) | |||||
| Mean (SD) | 316 (185) | 333 (186) | 308 (184) | 366 (195) | 321 (186) |
| Median [Min, Max] | 300 [0, 840] | 300 [0, 840] | 300 [1.00, 840] | 300 [10.0, 840] | 300 [0, 840] |
| Missing | 17 (0.9%) | 34 (1.5%) | 26 (1.2%) | 1 (0.8%) | 78 (1.2%) |
| Obese | |||||
| No | 1883 (100%) | 1325 (57.3%) | 2127 (100%) | 124 (100%) | 5459 (84.7%) |
| Yes | 0 (0%) | 986 (42.7%) | 0 (0%) | 0 (0%) | 986 (15.3%) |
| Variables | Type | Example | Number.Unique | MissingPct | Comment |
|---|---|---|---|---|---|
| id | integer | 1, 2, 3 | 6482 | 0% | Identification Code (1 - 6482) |
| gender | factor | Male, Female | 2 | 0% | Gender (1: Male, 2: Female) |
| age | integer | 34, 16, 60 | 65 | 0% | Age (Years) |
| marstat | factor | Married, NA, Widowed | 6 | 9.7% | Marital Status (1: Married, 2: Widowed, 3: Divorced, 4: Separated, 5: Never Married, 6: Living Together) |
| samplewt | numeric | 80100.544, 13953.078, 20090.339 | 2499 | 0% | Statistical Weight (4084.478 - 153810.3) |
| psu | integer | 1, 2 | 2 | 0% | Pseudo-PSU (1, 2) |
| strata | integer | 9, 10, 1 | 15 | 0% | Pseudo-Stratum (1 - 15) |
| tchol | integer | 135, 192, 202 | 251 | 6.09% | Total Cholesterol (mg/dL) |
| hdl | integer | 50, 60, 45 | 112 | 6.09% | HDL-Cholesterol (mg/dL) |
| sysbp | integer | 114, 112, 154 | 61 | 8.53% | Systolic Blood Pressure (mm Hg) |
| dbp | integer | 88, 62, 70 | 40 | 9.16% | Diastolic Blood Pressure (mm Hg) |
| wt | numeric | 87.400002, 72.300003, 116.8 | 957 | 0.57% | Weight (kg) |
| ht | numeric | 164.7, 181.3, 166 | 527 | 0.57% | Standing Height (cm) |
| bmi | numeric | 32.22, 22, 42.39 | 2276 | 0.57% | Body mass Index (Kg/m^2) |
| vigwrk | factor | No, Yes, NA | 2 | 0.02% | Vigorous Work Activity (1: Yes, 2: No) |
| modwrk | factor | No, Yes, NA | 2 | 0.02% | Moderate Work Activity (1: Yes, 2: No) |
| wlkbik | factor | No, Yes, NA | 2 | 0.02% | Walk or Bicycle (1: Yes, 2: No) |
| vigrecexr | factor | No, Yes, NA | 2 | 0.02% | Vigorous Recreational Activities (1: Yes, 2: No) |
| modrecexr | factor | No, Yes, NA | 2 | 0.03% | Moderate Recreational Activities (1: Yes, 2: No) |
| sedmin | integer | 480, 240, 720 | 37 | 1.22% | Minutes of Sedentary Activity per Week (0 - 840) |
| obese | factor | No, Yes, NA | 2 | 0.57% | BMI>35 (1: No, 2: Yes) |
According to CDC’s classification on bodyweight, we have: BMI<18.5 as Underweight, BMI between 18.5 and 24.9 as Health, BMI between 25 and 29.9 as Overweight, and BMI>30 as obesity. We adopted this category and found that there was a slight positive relationship between bodyweight and the total cholesterol level. However, we noticed that there was a negative relationship between the HDL and bodyweight. Because of the fact that Tchol is the sum of HDL and LDL, we can conclude that the obese population has a high level of LDL and a low level HDL.
According to ATPIII (n.d.), we can also categorize the cholesterol level.
This data set mainly focus on the observers between 16 to 80 years old. Among them, the average weight for male is greater than female among all ages, and as we can see from the line chart that the change in average weight with age follow the same trend across the gender, with a general tendency to sustained increase, followed by fluctuation and continuous decrease finally. It can be considered as there might exist some relationship with weight and age.
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
For the observations in different marital status, we also interested
in the relationship between weight and marital status.The following box
plot shows that the median weight under different marital status are all
around 80 Kg, widowed observation have lowest weight among six
categories. Married and Never Married observations have more people
heavier than 130 Kg than other categories, which may not good for
health.
Different types of work and recreational activities also are
interesting variables to discuss.We can see that only 1 missing data in
these four variables,so we omit this missing data directly. For the
vigorous activities, observations not in both work and recreational
activities got the lowest and highest BMI. No matter the condition on
work activities, upper quartile for observations are below 30 and lower
quartile greater than 20 in observations have vigorous recreational
activities. This means that majority people in this condition have a
healthy BMI index. For moderate activities, observation not in both
moderate work and recreational activities have the highest BMI and
observation only in moderate work activities have the lowest BMI. No
matter the condition on work activities, upper quartile for observations
are slightly above 30 and lower quartile greater than 20 in observations
have vigorous recreational activities. In general, we can see from two
plots that observations have vigorous or moderate recreational
activities trend to have healthy BMI index than others,and under same
recreational activities condition observations in moderate or vigorous
work activities have less number of observations have high BMI index.
Hence, different types of work or recreational activities may have
relationship with weight and affect BMI index in this way.
| No | Yes | ||
|---|---|---|---|
| Marital Status | Married | 2530 | 474 |
| Widowed | 418 | 86 | |
| Divorced | 528 | 112 | |
| Separated | 158 | 35 | |
| Never Married | 863 | 160 | |
| Living Together | 388 | 66 |
Let X be the categorical random variable for Marital Status and Y be the one for Obesity. Assuming a random sample of n trials. Define the count random variable \(N_{ij}:=\sum_{k=1}^n \mathbf{I}_k(X=i, Y=j)\) where \(\mathbf{I}_k\) is the indicator function for the k-th trial, then the joint random variables \([N_{11}, ..., N_{IJ}]\) has a Multinomial distribution \(\vec{p}=[p_{11}, ..., p_{IJ}]\). Our hypothesis test is therefore:
\[\begin{gather*} H_0: p_{ij}= p_{i+} \cdot p_{+j} ~ \forall i,j\\ H_1:p_{ij} \neq p_{i+} \cdot p_{+j} ~ \forall i,j \end{gather*}\]
We use the chi-squared test to conclude that there is not enough evidence to reject the null hypothesis with a p-value equal to 0.6894. In other words, we cannot conclude that there is a relationship between obesity and marital status.
We do the same test for other variables compared with obesity. From Table 2 we can see that we can reject the independence between obesity and wlkbik, vigrecexr and modrecexr variables.
| vigwrk | modwrk | wlkbik | vigrecexr | modrecexr | |
|---|---|---|---|---|---|
| p-value | 0.5695 | 0.3037 | 1.064e-07 | 4.061e-15 | 2.573e-09 |